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System and User Interface Supporting User Navigation of Multimedia Data 

File Content 

This application claims the benefit of U.S. 
5 Provisional Application No. 60/256,293, filed December 18, 
.2000 . 

BACKGROUND OF THE INVENTION 

1 . Field of the Invention : 

The present invention is concerned with processing 
10 multimedia data files to provide information supporting 
user navigation of multimedia data file content. 

2 . Background of the Invention 
py The demand for hypermedia applications has increased 

^15 with the growing popularity of the World Wide Web. As a 
5 result, a need for an effective and automatic method of 

M= creating hypermedia has arisen. However, the creation of 

m hypermedia can be a laborious, manually intensive job. In 

particular, hypermedia creation can be difficult when 
:^2Q referencing content in documents including images and/or 

other media . 

In many cases, the hypermedia authors need to locate 
Anchorable Information Units (AIUs) or hotspots that are 
areas or. keywords of particular significance, and make 

25 appropriate hyperlinks to relevant information. In an 
electronic document, a user can retrieve associated 
information by selecting these hotspots as the system 
interprets the associated hyperlinks and fetches the 
corresponding relevant information. 

30 Previous research in this field has taken scanned 

bitmap images as the input to a document analysis system. 
The classification of the document system is often guided 
by a priori knowledge of the document's class. There has 
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been little work done in using postscript files as a 
starting point for document analysis. Certainly, if a 
postscript file is designed for maximum raster efficiency, 
it can be a daunting task even to reconstruct the reading 
5 order for the document. Previous researchers may have 
assumed that a well-structured source text will always be 
available to match postscript output and therefore working 
bottom-up from postscript would seldom be needed. However, 
PDF documents can be generated in a variety of ways 
,.^10 including an Optical Character Recognition (OCR) based 
hU route directly from a bit-mapped page. The extra structure 
^ in PDF, over and above that in postscript, can be utilized 
!;Jl towards the goal of document understanding. 

[ ¥ s Previous work proposed methods related to the 

^15 understanding of raster images. Being an inverse problem by 
; 6 definition, this task cannot be accomplished without making 

broad assumptions. Directly applying these methods on PDF 
I'U documents would make little sense as they are not designed 
j;2 to make use of the underlying structure of PDF files, and 
! ±20 thus will produce undesirable results. 

In contrast to the geometric layout analysis, logical 
layout analysis has received very little attention. Some 
methods of logical layout analysis perform region 
identification or classification in a derived geometric 
25 layout. However, these approaches are primarily rule based 
and thus, the final-outcome depends on the dependability of 
the prior information and how well the prior information is 
represented within the rules. 

Systems such as Acrobat do not have the ability to 
30 process images . Rather Acrobat runs the whole document 
through an OCR system. Clearly, OCR is not able extract 
objects, but even in the case of understanding text the 
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output can be unreliable as a general -purpose OCR can be 
error prone when used to understand scanned in images 
directly . 

Therefore, a need exists for a method of analyzing and 
5 extracting text from PDF documents created using various 
means . 

SUMMARY OF THE INVENTION 

According to an embodiment of the present invention, a 
10 system is provided for processing a multimedia data file to 
^ provide information supporting user navigation of 
yj multimedia data file content. The system includes a content 
parser to identify text and image content of a data file, 
py and an image processor for processing said identified image 
^15 content to identify embedded text content- The system 
further includes a text sorter for parsing said identified 
I* text and said identified embedded text to locate text items 
ry in accordance with predetermined sorting rules, and memory 

for storing a navigation file containing said text items. 
^ 20 The navigation file links to at least one internal 

document object. The navigation file links to at least one 
external document object. 

The image processor includes a black and white image 
processor including a pixel smearing component reducing 
25 text to a rectangular block of pixels, and an image 
filtering component for cleaning a smeared image. 

The content parser applies text extraction rules to 
identify text and identify a document structure, wherein 
the document structure defines a context for identified 
30 text. The content parser applies pre-defined hierarchical 
rules for determining a level of identified text. 
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The image processor applies object templates to 
identify embedded text. 

The system refines a search resolution during a text 
identifying process to determine a location of the embedded 
5 text within an image. 

Identified text comprises hyperlinks. 

According to another embodiment of the present 
invention, a graphical User interface system is provided 
supporting processing of a multimedia data file to provide 
,^10 information supporting user navigation of multimedia data 
\.n file content. The graphical User interface system includes 
^ a menu generator for generating, one or more menus 
i'P permitting User selection of, an input file and format to 
i'w be processed, and an icon permitting User initiation of 
;J;15 generation of a navigation file supporting linking of input 
file elements to external documents by parsing and sorting 
;~ text and image content to identify text for incorporation 

ftj in a navigation file. 

Identified text comprises hyperlinks. 
^20 The navigation file further comprises links to at 

least one internal document object. 

According to an embodiment of the present invention, a 
method is provided for creating an anchorable information 
unit in a portable document format document. The method 
25 includes extracting a text segment from the portable 
document format document, determining a context of the 
segment, wherein the context is selected from a context 
sensitive hierarchical structure, and defining the text 
segment as an anchorable information unit according to the 
30 context . 
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The portable document format document includes one or 
more textual objects and one or more non- textual objects, 
wherein the objects include textual segments. 

Determining the context includes comparing the text 
segment to a plurality of known patterns within the 
portable document format document, and determining the 
context upon determining a match between the text segment 
and a known pattern of the portable document format 
document . 

Extracting text further includes extracting text form 
an image of the portable document format document, 
determining an image type, wherein the type is one of a 
black and white image, a grayscale image, and a color 
image, and processing the image according to the type. 

The portable document format document includes a known 
context sensitive hierarchical structure. The context 
sensitive hierarchical structure, including the anchorable 
information unit, is searchable. The context includes a 
location of the extracted text segments. Determining the 
context includes determining a location and a style of the 
text segment . 

The method further includes storing the text segment 
in a Standard Generalized Markup Language syntax using a 
predefined grammar. 

The achorable information unit is automatically 
hyper linked. 

According to an embodiment of the present invention, a 
method is provided for creating an anchorable information 
unit file from a portable document format document. The 
method includes parsing the portable document format 
document into textual portions and non-text portions. The 
method further includes extracting structure from the 
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textual portions and the non-text portions, and determining 
text within textual portions, and text the non-text 
portions. The method hyperlinks a plurality of keywords 
within the textual portions and non-text portions to at 
5 least one related document. 

Parsing further comprises the step of differentiating 
color image content, black-and-white content, and grayscale 
content . 

Extracting further comprises determining a level for 
10 extracted textual portions, associating the context with 
the text, and pattern matching extracted text to the 
portable document format document to determine a context. 
The level is one of a paragraph, a heading and a 
subheading. Pattern matching includes determining a median 
Hi15 font size for the portable document format document, 
comparing a font size of the extracted text to the median 
M= font size for the portable document format document, and 
i^* determining a context according to font size. 

tj% Hyperlinking includes creating the anchorable 

-=20 information unit file, wherein the plurality of keywords 
are anchorable information units. 

According to an embodiment of the present invention, a 
program storage device is provided, readable by machine, 
tangibly embodying a program of instructions executable by 
25 the machine to perform method steps for creating an 
anchorable information unit file from a portable document 
format document . 

BRIEF DESCRIPTION OF THE DRAWINGS 
Preferred embodiments of the present invention will be 
30 described below in more detail, with reference to the 
accompanying drawings : 
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Fig. 1 is a flow chart showing an overview of a method of 
creating an anchorable information unit according to an 
embodiment of the present inventin; 

Fig. 2 is a flow chart showing a method of creating an 
5 anchorable information unit according to an embodiment of the 
present invention; and 

Figs. 3a-b are a flow chart showing a method of creating an 
anchorable information unit according to an embodiment of the 
present invention. 
10 Fig. 4 shows a graphical User interface display supporting 

■n processing of a multimedia data file to provide information for 

%3 use in navigating multimedia data file content, according to an 
:5[ embodiment of the present invention. 

St I 

^15 DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

:„ The present invention provides an automated method for 

^ locating hotspots in a PDF file, and for creating cross- 
ly 

!|j referenced AIUs in hypermedia documents. For example, text 
ifl strings can point to a relevant machine part in a document 
; i= f20 describing an industrial instrument. 

It is to be understood that the present invention may 
be implemented in various forms of hardware, software, 
firmware, special purpose processors, or a' combination 
thereof. In one embodiment, the present invention may be 
25 implemented in software as an application program tangibly 
embodied on a program storage device. The application 
program may be uploaded to, and executed by, a machine 
comprising any suitable architecture. Preferably, the 
machine is implemented on a computer platform having 
30 hardware such as one or more central processing units 
(CPU), a random access memory (RAM), and input/output (I/O) 
interf ace ( s ) . The computer platform also includes an 
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operating system and micro instruction code. The various 
processes and functions described herein may either be part 
of the micro instruction code or part of the application 
program (or a combination thereof) which is executed via 
5 the operating system. In addition, various other peripheral 
devices may be connected to the computer platform such as 
an additional data storage device and a printing device. 

It is to be further understood that, because some of 
the constituent system components and method steps depicted 
10 in the accompanying figures may be implemented in software, 
3 the actual connections between the system components (or 
^ the process steps) may differ depending upon the manner in 
§ which the present invention is programmed. Given the 
i[ teachings of the present invention provided herein, one of 
*j15 ordinary skill in the related art will be able to 
contemplate these and similar implementations or 
configurations of the present invention. 

The PDF files under consideration can include simple 
I text, or more generally, can include a mixture of text and 
20 a variety of different types of images such as black and 
white, grayscale and color. According to an embodiment of 
the present invention, the method locates the text and non- 
text areas, and applies different processing methods to 
each. For the non-text regions, different image processing 
25 methods are used according to the type of images contained 
therein. 

The extraction of AIUs is important for the generation 
of hypermedia documents. However, for some PDF files, e.g., 
those that have been scanned into a computer, this can be 
30 difficult. According to an embodiment of the present 
invention, the method decomposes the document to determine 
a page layout for the underlying pages. Thus, different 
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methods can be applied to the different portions of a page. 

A geometric page layout of a document is a specification of 

the geometry of the maximal homogeneous regions and their 

classification (text, table, image, drawing etc) . Logical 

5 page layout analysis includes determining a page type, 

assigning functional labels such as title, note, footnote, 

caption etc., to each block of the page, determining the 

relationships of these blocks and ordering the text blocks 

according to a reading order. 

. lf= 10 OCR has had an important role in prior art systems for 

-,,n determining document content. Accordingly, OCR has received 

^ most of the research focus . Page segmentation plays an 

important role in this domain because the performance of a 

i - y document understanding system as a whole depends on the 
Si 

1^15 preprocessing that goes in before the OCR. 

The present invention analyzes the document and 
:'7 extracts information from the text and/ or figures that can 
Hj be located anywhere within the document. The method 
^ determines the context in which these hotspots (e.g., 
1^20 objects or text-segments of interest) appear. Further, the 
method saves this information in a structured manner that 
follows a predefined syntax and grammar that allows the 
method to refer to that information while creating 
automatic hyperlinks between different documents and media 
25 types . 

A flow chart showing the main stages in the graphics 
recognition process is shown in Fig. 1. The input to the 
system includes a PDF file 101. The method parses the file 
into areas of text and non-text 102. The text and non-text 
30 regions are analyzed to extract structure and other 
relevant information 103. The method determines text within 
regular text blocks 104, as well as text within the images 
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105-108 (if any) , such as item numbers within an 
engineering drawing. The method distinguishes between color 
images and black and white images 105 before extracting 
text from an image. These text segments are used for 
5 hyperlinking with other documents 109-110, for example, 
another PDF file or any other media type such as audio, 
video etc . 

In order to help application programmers extract words 
from PDF files, Adobe Systems provides a software 
p10 development kit (SDK) that gives access, via the 
''tS application programmers interface (API) of Acrobat® 
% viewers, to the underlying portable document model, which 
the viewer holds in memory. The SDK is able to conduct a 
search for PDF documents . For PDF documents that are 
;.i15 created directly from a text editor such as Microsoft's 
*. Word or Adobe's FrameMaker®, this works quite well, however 
I^l for scanned in documents, the performance can decrease 
significantly. Additionally, for double columned documents, 
the SDK can be error prone. SDK was designed primarily for 
M£0 documents created using a text editor. Therefore, 
performance with documents created by other means, was not 
an important issue. The present invention uses an 
alternative strategy for scanned in documents. 

According to an embodiment of the present invention, 
25 the method extracts words along with their location in the 
document, and the style used to render them. The method not 
only determines whether a certain word exists in a page or 
not, but also determines the location and the context in 
which it appears, so that a link can be automatically 
30 created from the location to the same media or a different 
one based on the content. 
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Referring to Fig. 2, the method extracts 202 text, the 
coordinates of the text, and the text style from a PDF file 
201. The method analyzes parameters of the PDF file to 
determine the context in which the text appears 203-205. 
5 The parameters include, inter alia, paragraphs 2 03, 
headings 204, and subheadings 205. The method further 
extracts text and assocated bounding boxes, and page 
numbers. The parameters of a bounding box are determined 
from the extracted coordinates. The method associates 
10 context with text 206. For example, if the bounding box is 
j ;3 aligned horizontally with several other words, e.g., if the 
Ifi text appears at similar heights and is part of a larger 
hU group, then the method determines this text to be part of 
%\ regular text (e.g., a paragraph) for the page, as opposed 
SJ15 to, for example, a heading. 

; '~ The method determines the median font size for a 

i IS ± portion of the text document and performs context sensitive 
pattern matching 207. If the font size for a portion of 

i 4J 

text is larger than the median, and if the text portion is 
(320 small, e.g., the text does not extend more than a single 
'* line, the method determines this to be part of a heading. 
Upon determining a heading, the method checks the text 
level, e.g., whether it belongs to a chapter heading, a 
section heading, a subsection, etc. The text level can also 
25 be determined from the relative font sizes used and offsets 
from the right or left margin, if any. 

Once the method has determined all the text 
information regarding the organization of the document, the 
method uses organization information to selectively create 
30 Anchorable Information Units (AIUs) 208-209 or hotspots . 
The method automatically or semi-automatically creates 
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these hotspots in a context sensitive non-redundant manner 
based on the organization information. 

The present invention provides a method for extracting 
images. What makes this problem challenging is that text 
5 may not be distinguished from polylines, which constitute 
the underlying line drawings. While developing a general 
method that would work for all kinds of line-drawing images 
is difficult, the present invention makes use of underlying 
structures of the concerned documents . The present 
^10 invention localizes images according to the geometry and 
!| W length of the text strings. These localized regions are 
,S analyzed using OCR software to extract the textual content, 
ijs Referring to Figs. 3a and 3b, the method extracts 

images and their location 302 from a PDF file 301. In PDF 
M>15 files, various types of images can be encoded, including 
black and white, grayscale and colored images. Objects of 
i„± interest can be encoded in any of these images. For 
iU example, a black and white image can be used to encode a 
computer aided design (CAD) drawing. CAD images can 
h±20 include, for example, diagrams of predefined objects or 
text segments that may refer to important information, such 
as machine parts. Other images can include, for example, 
descriptions of machine parts, especially if the documents 
are of an engineering nature. 
25 In PDF, an image is called an Xobject, whose subtype 

is Image. Images allow a content stream to specify a 
sampled image or image mask. The method determines the 
type of image 303. PDF allows for image masks, e.g., 1-bit, 
2 -bit, 4-bit and 8-bit grayscale images and color images 
30 with 1, 2, 4 or 8 bits per component. An image mask, such 
as an external image, can be embedded within the PDF file. 
For embedded images, the method determines a reference to 
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that image, and based on the type of image and the file 
format, an appropriate decoding technique can be used to 
extract the image and process it 304. However, if it is a 
sampled image, then the image pixel values are stored 
directly within the PDF file in a certain encoded fashion. 
The image pixel values can be first decoded and then 
processed 305. 

The method simplifies the images to extract text 
strings 306. The grayscale images are converted to black 
and white images by thresholding 307 . The method looks for 
text strings in either grayscale or black/white images. 
Thus, if the image is non-colored, it is reduced to black 
and white. 

For the black and white images, the method smears the 
image 308. Within an arbitrary string of black and white 
pixels the method replaces white pixels with black pixels 
if the number of adjacent white pixels between two black 
pixels is less than a predetermined constant. This constant 
is related to the font-size and can be user-defined. This 
operation is primarily engaged in the horizontal direction. 
The operation closes the gaps that may exist between 
different letters in a word and reduce a word to a 
rectangular block of black pixels. However, it also affects 
the line drawings in a similar fashion. The difference here 
is that by the very nature of their appearance, text words 
after the operation look rectangular of a certain height 
(for horizontal text) and width (assuming that the part 
numbers that appear in an engineering drawing are likely to 
be of a certain length) . However, the line drawings 
generate irregular patterns, making them discernible from 
the associated text. 
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The method cleans the resultant image by using median 
filtering 309 to remove small islands or groups of black 
pixels. The method groups the horizontal runs of black 
pixels into groups separated by white space and associate 
labels to them 310. The method computes a bounding box 311 
for each group and computes such features as width, height, 
aspect ratio and the pixel density, e.g., the ratio of the 
number of black pixels to the area of the bounding box. 

The method implements rules 312 to determine whether 
there is text inside the bounding box and if so, whether 
the text is of interest. The method rules out regions that 
are either too big or too small using a threshold 
technique. The method searches for a word or two that makes 
up an identifier, such as a part number or part name. The 
method also rules out regions that are square in nature 
rather than rectangular as defined by the aspect ratio 



have a height of one character. The method also rules out 
regions that are relatively empty e.g., the black pixels 
are connected in a rather irregular, non-rectangular way. 
This is a characteristic of line drawings and is unlikely 
to be associated with text strings. The limits in the above 
are domain dependent and the user has the ability to choose 
and modify them based on the characteristics of the 
document processed. 

After the plausible text areas have been identified, 
the method uses an OCR toolkit 313 to identify the ASCII 
text that characterizes the plausible regions identified 
above. Once the method has determined the text, a pattern 
matching method is used 314 to correct for errors that may 
have been made by the OCR during recognition. For example, 
the OCR may have erroneously substituted the letter "o" for 




as normally words are several characters long and 



2000P09096US01 

15 

the numeral "0". If the method is aware of the context, 
such errors can be rectified. 

The method keeps words and/or phrases of interest and 
saves them in an AIU file. Once the method has extracted 
5 and saved the text of interest, object parts, if any, are 
identified within the images 316. 

To increase the speed of the method, the non-text 
regions of the image are parsed into blocks . A histogram of 
the pixel gray level or color values in these blocks 317- 
,-?s10 318 is then analyzed. For a color image, the method 
'1.3 analyzes a histogram for the whole image. 

^= The method implements templates of objects that are 

Cf= being searched for in the image. The method parses the 

template into blocks and determines a histogram for the 
h& 15 blocks. The method determines locations in the original 
| i: image of blocks that have a similar histogram signature as 

that of the template. Upon determining a match 319, the 
i'V method performs a more thorough pixel correlation 32 0 to 
^ determine the exact location. 

M=20 The method can begin with at a low resolution, for 

example, using 32x32 blocks. If a match is found, the 
method can reiterate at a higher resolution, e.g., 16x16. 
After the reiteration to a scale of, for example, 8x8, the 
method correlates the template with the' original to find a 

25 location of a desirable match. However, before performing 
a correlation, the method binarizes the image 321, if it is 
not already in binary form, by computing edges. For the 
binarized image, the method performs a correlation for the 
edges. Thus, the method can reduce the amount of processing 

30 needed to process an image. 

Matches are determined using a threshold 323, which 
can be set at 0.6xN e where A^is the number of edge points in 
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the template. The method determines the information needed 
both for the text and non-text portion of the PDF files and 
the assimilated information is stored in AIU files 324-325 
using a Standard Generalized Markup Language (SGML) . SGML 
5 syntax can be used to create hyperlinks to other parts of 
the same document, or to other documents or non-similar 
media types. 

According to an embodiment of the present invention, 
the structure of PDF documents is defined in SGML. The 
10 structural information can be used to capture the 
O information extracted from a PDF. The objects that are 

'% extracted from the PDF are termed Anchorable Information 
hO Units (AIUs) . Since information extracted from a PDF 
'Z\ document is represented as an instance of the PDF AIU 
"=J15 Document Type Definition (DTD), and thus, well structured, 
'"^ the method can perform automatic hyperl inking between the 

PDF documents and other types of documents. Therefore, 
when the user clicks on the object during browsing, the 
^ appropriate link can be navigated to reach the desired 
Q20 destination. 

After processing, each PDF file is associated with an 
AIU file, which includes relevant information extracted 
from the PDF file. The AIU file is defined in a 
hierarchical manner as follows: 
25 At the root the AIUDoc definition encompasses the header, 
footer and the extracted information within the PdfDocX 
field. 



<! ELEMENT AIUDoc -- (DocHeader , PdfDocX, 

30 DocFooter)> 

< ! ATTLIST AIUDoc 

Id CDATA # IMPLIED 
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□ 



Type CDATA # IMPLIED 

Name CDATA # IMPLIED 



5 The definition of the DocHeader is given as: 

<! ELEMENT DocHeader --(DocType, DocDesc)> 

< ! ATTLIST DocHeader 

Id CDATA # IMPLIED 

10 Type CDATA # IMPLIED 

Name CDATA # IMPLIED 

File CDATA # IMPLIED 



.,'15 and the fields in the PdfDocX is given by (these fields 
will be defined below) : 

fU <! ELEMENT PdfDocX — ((Pdf Object j PdfAIU)*)> 

ffl <! ATTLIST PdfDocX 

;2 20 ' Id CDATA # IMPLIED 



The PdfSeg field, which characterizes the sections is 
defined as : 



25 



30 



<! ELEMENT PdfSeg --((PdfSeg | PdfAIU)*)> 

<! ATTLIST PdfSeg 

Id CDATA # IMPLIED 

> 

while the PdfSeg2 fields which are the segments in this 
document are defined by: 
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< ! ATTLIST 



Pdf Seg2 
PdfSeg2 
Id 

StartLocation 
EndLocation 



— (PdfAIU* ) > 

CDATA 
CDATA 
CDATA 



#IMPLIED 
# IMPLIED 
# IMPLIED 



10 
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the AIUs are defined using the following fields 



< ! ELEMENT 
<! ATTLIST 



PdfAIU 
PdfAIU 
Id 

Type 
Name 



-- (Link* ) > 

CDATA 
CDATA 
CDATA 



BoundaryCoords CDATA 



# IMPLIED 
# IMPLIED 
# IMPLIED 
# IMPLIED 



Thus, an AIU file is a sequence of one or more 
20 parsable character data. In the example, the character data 
includes a string of ASCII characters and numbers. While 
various attributes relevant to PDF AIUs are listed above, 
additional attributes can be relevant for AIUs related to 
other media types. As mentioned before, the method 
25 structures the PDF document in a hierarchical manner. At 
the root is the entire document. The document is broken up 
into sub-documents. The AIU file starts with a description 
of the type of the underlying media type, which in this 
case is PDF. The document header includes four different 
30 fields including the underlying PDF file name, an unique 
identifier for the whole PDF file, a document type 
definition, which explains the context of the PDF file, and 
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a more specific document description explaining the content 
of the PDF file. 

The information extracted from the PDF file is stored 
within the PDFDocX structure. The PDFDocX structure 
includes a unique identifier derived from the identifier of 
the PDF file itself. The PDF document is organized in a 
hierarchical manner using sub-documents and segments. The 
segments have the following attributes. Once again, there 
is a unique identifier for each segment. The start and end 
locations of these segments define the extent of these 
sections. Based on the needs and size of the document, 
further attributes can be used as well. 

The PDF AIUs include a unique identifier. The PDF AIUs 
can be of the following types: rectangle, ellipse and 
polygon. Each AIU also has a unique name. The 
BoundaryCoords field describes the coordinates of the 
underlying object of interest and defines the bounding box. 
The page field describes the page location of the 
underlying document. In case of rectangles and ellipses, 
the upper left and lower right corners of the bounding box 
are defined. In case of a polygon, all the nodes are 
defined. 
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An example of a PDFAIU file is given below. The link 
definition is described in the following subsection. 
<AIUDoc> 

<DocHeader Type= " Pdf " File= " test . aiu " ID="Ntest"> 
5 <DocType>Training, /DocType> 

<DocDesc>Overview of test engine</DocDesc> 
</DocHeader> 
<PdfDocX Id="IDV942"> 
<PdfSeg Id="sectionl"> 
10 <PdfSeg2 Id="IDV942Pl" StartLocation="0" EndLocation="20"> 
□ </PdfSeg2> 

% <PdfSeg2 Id="IDV942P2" StartLocation="21" EndLocation= " 50 " > 
sp </PdfSeg2> 
%\ </PdfSeg> 

\J15 <PdfAIU Id="PAIU01" Type= "rectangle" Name="objectl" 
h * Page="2" BoundaryCoords="66 100 156 240 "> 
1+ </PdfAIU> 

M* <PdfAIU Id="PAIU02" Type=" ell ipse" Name= "obj ect2 " Page="8" 
% BoundaryCoords="100 156 240 261 "> 
□20 </PdfAIU> 

<PdfAIU Id="PAIU03" Type= "polygon" Name= "obj ectl " Page="22" 
BoundaryCoords = "438 81 411 88 397 102 383 138 406 185 480 
175 493 122 465 89 438 81"> 
</PdfAIU> 
25 </PdfDocX> 

<DocFooter>< /DocFooter> 
</AIUDoc> 

30 . Hyperl inking for the PDF AIUs can be done manually or 

in an automatic fashion. Manual links can be inserted 
during the AIU outlining phase described before. However, 
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according to an embodiment of the present invention, since 
the information extracted from PDF is stored in well- 
structured AIU files, the method includes an Automatic 
Hyperlinker to automatically hyperlink PDF AIUs with all 
other types of documents based on Hyperlink Specifications. 
That is, the Hyperlinker processes link specifications, 
performs pattern matching on the contents and structures of 
the documents, and establishes links between sources and 
destinations. Also important is how the link information 
encoded within the AIU files. Each of the objects encoded 
can potentially have a link. Since the SGML structure has 
been adopted for the AIU files and links are entities 
within that file, Links are also defined using a similar 
SGML structure. The definition and the fields are given 
below : 

<! ELEMENT Link — ( ( # PCDATA) + ) > 

< ! ATTLIST Link 

Linkld CDATA #IMPLIED 

Type CDATA # IMPLIED 

SubType CDATA # IMPLIED 

Linkend CDATA # IMPLIED 

Book CDATA # IMPLIED 

FOCUS CDATA # IMPLIED 

LinkRuleld CDATA #IMPLIED 



> 

The Type defines the type of the destination, e.g., if 
it is text or image or video, etc. Focus defines the text 
that is highlighted at the link destination. Book 
represents the book that the destination is part of. In the 
example, since the main application is a hyperlinked 
manual, they are organized as a hierarchical tree, where 
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each manual is represented as a book. Linkend, the most 
important attribute, contains the destination information. 
Linkld is an index to the database if the destination 
points to that. Linkruleld indicated what rule created this 
5 link. SubType is similar to the Type definition in the AIU 
specification above. Labels give a description of the link 
destination. There can be other attributes as well. 

In the following, an instance of a hyperl inked AIU 
file is provided. That is, Link elements can be manually, 
10 or automatically added to PDF AIUs that are to be 
''f hyperlinked to their destinations during playback. 
i,r| <AIUDoc> 

<DocHeader Type="Pdf" File= " test . aiu " ID="Ntest" > 
iij <DocType>Training, /DocType> 

Vl5 <DocDesc>Overview of test engine</DocDesc> 

\, </DocHeader> 

M* <PdfDocX Id="IDV942"> 

<PdfSeg Id="sectionl"> 
f;| <PdfSeg2 Id="IDV942Pl" StartLocation="0" EndLocation="20"> 
020 </PdfSeg2> 

<PdfSeg2 Id="IDV942P2" StartLocation= " 2 1 " EndLocation= " 50 " > 

</Pdf Seg2> 

</Pdf Seg> 

<PdfAIU Id="PAIU01" Type="rectangle" Name= "obj ectl " 
25 Page="2" BoundaryCoords=" 66 100 156 240 "> 

<Link Type="Text" SubType="ID" Linkld="7001" 

Linkend=" "N13 509426" Book="31" Labels="Text Document in Vol 

3 . 1 " > 

</Link> 
30 </PdfAIU> 

<PdfAIU Id="PAIU02" Type=" ell ipse" Name=" obj ect2 " Page="8" 
BoundaryCoords="100 156 240 261"> 
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BoundaryCoords="66 100156240"> 

<Link Type="Image" SubType="ID" Linkld="7001" 

Linkend=" "N13509426" Book="31" Labels= " Image Description'^ 
</Link> 
5 </PdfAIU> 

<PdfAIU Id="PAIU03" Type= "polygon" Name="objectl" Page="22" 
BoundaryCoords = "43 8 81 411 88 397 102 383 138 406 185 480 
175 493 122 465 89 438 81"> 
</PdfAIU> 
10 </PdfDocX> 

<DocFooterx/DocFooter> 
</AIUDoc> 



The SGML documents (including the AIU files) are 
>!15 preindexed using the SGML Indexer. This includes a 
dictionary listing every SGML element in the order they 
appear in the documentation and an index into that 
j dictionary. Based on the user-defined link specifications, 
i links are created using pattern matching on these 
20 dictionary files. For PDF AIUs, links can be created to and 
from them in this way. The main point to note about the 
hyperlinker is that the method is able to use this 
machinery within the PDFAIU authoring system by being able 
to structure the PDF information using the AIU 
25 specification language as explained before. This also 
allows the method to implement a hyperlink management 
system that can incrementally update link rules. The link 
manager software that uses the link database to keep track 
of link rule changes by using time stamps does this. 
30 Incremental hyperlinking is done either by changing 
existing link specifications or by adding some extra link 
specifications. When adding new link specifications, the 
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hyperlinker executes the new link specification on all 
documents, adds new links without destroying the old ones. 
When a link becomes obsolete, based on the Id of the old 
link, the old links are removed. A similar procedure is 
5 adopted when adding new links. 

After the hyperlinking has been achieved, it is 
important to be able to get the desired functionality while 
viewing. The current implementation modifies Adobe Acrobat® 
Reader™ and uses a special purpose software to achieve 
10 interprocess communication via a link manager. When the 
''i viewer is given a command to load a certain PDF file, while 
h5 loading it, it also looks to see if an AIU file is 
ij? available for that file. If so, it is also loaded along 
py with the original file. For each entry, in the AIU file, a 
*=M5 boundary is drawn around the object of interest. If the 
user clicks on any of the objects, the viewer communicates 
with the link manager with the appropriate Link Identifier. 
':Z The Link Manager then executes the link destination. Often 
rfj within a multimedia documentation environment, this means 
•3-20 jumping to a particular point of the text or showing a 
detailed image of the object in question. In that case the 
SGML browser jumps to that point in the SGML document. 

Figure 4 shows a graphical User interface display 
supporting processing of a multimedia data file to provide 
25 information for use in navigating multimedia data file 
content. User selection of icon 400 permits User 
initiation of generation of a navigation file supporting 
linking of input file elements to external documents by 
parsing and sorting text and image content to identify text 
30 for incorporation in a navigation file. Further, in 
response to user selection of icon 400, items are activated 
within menus generated upon user selection of a member of 
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toolbars 405 and 410. Specifically, a menu permitting User 
selection of an input file and format to be processed is 
generated in response to user selection of icon 415. 

Having described embodiments for a method of 
extracting anchorable information units from PDF files, it 
is noted that modifications and variations can be made by 
persons skilled in the art in light of the above teachings. 
It is therefore to be understood that changes may be made 
in the particular embodiments of the invention disclosed 
which are within the scope and spirit of the invention as 
defined by the appended claims. Having thus described the 
invention with the details and particularity required by 
the patent laws, what is claimed and desired protected by 
Letters Patent is set forth in the appended claims. 



